Skip to content

✨ Write kubeadm control plane version file for workers to use to fetch the matching kubeadm binary#13433

Draft
AcidLeroy wants to merge 5 commits intokubernetes-sigs:mainfrom
AcidLeroy:kubeadm-version
Draft

✨ Write kubeadm control plane version file for workers to use to fetch the matching kubeadm binary#13433
AcidLeroy wants to merge 5 commits intokubernetes-sigs:mainfrom
AcidLeroy:kubeadm-version

Conversation

@AcidLeroy
Copy link
Copy Markdown

@AcidLeroy AcidLeroy commented Mar 10, 2026

What this PR does / why we need it

Kubernetes allows some skew between the control plane and kubelets, but kubeadm’s own skew policy requires the kubeadm binary used for kubeadm join to match the kubeadm used when the cluster was created or last upgraded on that path—so you cannot rely on an older kubeadm on the worker when the control plane is newer.

That conflicts with real Cluster API flows (e.g. scaling or remediating workers still on an older Kubernetes while the control plane has moved ahead), as discussed in #13315.

This PR:

  • Resolves the control plane Kubernetes version for join (from KubeadmControlPlane when available) and uses it when generating join bootstrap data so join config matches the cluster the node is joining. If the control plane object cannot be read while a controlPlaneRef is set, reconciliation fails and status conditions surface the error (no silent fallback to the Machine version in that case). When there is no control plane ref or the referenced object does not expose a version, the controller falls back to the Machine’s Kubernetes version as before.
  • Surfaces how join version was chosen on KubeadmConfig: the ControlPlaneKubernetesVersionAvailable condition stays True for both success paths, but Reason (and Message) distinguish version read from the control plane reference vs version taken from the Machine because the reference is unset or has no version—so operators can see at a glance whether the skew contract is being driven by the cluster control plane or the worker.
  • Exposes that version to user-defined bootstrap content via spec.files with contentFormat: go-template: files are rendered as Go text/template with data including KubernetesVersion, so operators can wire their own steps (scripts, package installs, downloads) to install a kubeadm binary that matches before kubeadm join runs—without CAPI prescribing a single install mechanism.
  • RBAC for the kubeadm bootstrap controller to read KubeadmControlPlane where needed for version resolution.
  • Tests: unit coverage for go-template parse vs execute failures in spec.files, controller tests for the new condition reasons, and E2E coverage (KubeadmVersionOnJoin + clusterclass-quick-start-kubeadm-version) demonstrating the pattern end-to-end.
sequenceDiagram
    participant CP as Control plane (newer K8s)
    participant BC as Kubeadm bootstrap controller
    participant W as Worker (older image / kubelet)

    BC->>CP: Read KubeadmControlPlane.spec.version
    CP-->>BC: e.g. 1.35.0
    BC->>BC: Build join data + TemplateData.KubernetesVersion
    BC->>W: Render spec.files (go-template) e.g. fetch script with {{ .KubernetesVersion }}
    Note over W: preKubeadmCommands (operator-defined)
    W->>W: Install/fetch kubeadm matching CP version
    W->>CP: kubeadm join (binary matches policy)
Loading

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):

Related to #13315

/area bootstrap
/area test

@k8s-ci-robot k8s-ci-robot added the area/bootstrap Issues or PRs related to bootstrap providers label Mar 10, 2026
@k8s-ci-robot
Copy link
Copy Markdown
Contributor

@AcidLeroy: The label(s) area/test cannot be applied, because the repository doesn't have them.

Details

In response to this:

What this PR does / why we need it:

When a worker node joins a cluster where the control plane has already been upgraded to a newer Kubernetes version, the kubeadm version skew policy can be violated: the joining node's kubeadm binary (matching its older OS image) is older than the control plane version. This becomes a real problem in scenarios like scaling up a MachineDeployment during an upgrade or when supporting workers pinned to older Kubernetes versions long-term.

This PR introduces a kubeadm version contract for worker nodes joining a cluster:

  1. Control plane version resolution for join: The kubeadm bootstrap controller now resolves the control plane version (from KubeadmControlPlane.spec.version) and uses it—instead of the joining machine's own version—when generating the kubeadm join configuration. For example, a v1.34 worker joining a v1.35 control plane will get join data generated for v1.35.

  2. Version file written to the node: A file is written at /run/cluster-api/kubeadm-version/version on every joining worker node (via both cloud-init and ignition). This file contains the control plane's Kubernetes version and acts as a contract: operators can provide a custom preKubeadmCommands script that reads this file and fetches/installs the matching kubeadm binary before kubeadm join runs.

  3. RBAC: The kubeadm bootstrap controller now has read access to KubeadmControlPlane resources to look up the control plane version.

  4. E2E tests: A new KubeadmVersionOnJoin e2e test validates the flow end-to-end using CAPD with a dedicated ClusterClass (clusterclass-quick-start-kubeadm-version) that embeds a fetch script reading the version file.

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Related to #13315

/area bootstrap
/area test

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Mar 10, 2026
@k8s-ci-robot
Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign chrischdi for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Mar 10, 2026
@k8s-ci-robot
Copy link
Copy Markdown
Contributor

Hi @AcidLeroy. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work.

Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Mar 10, 2026
@AcidLeroy AcidLeroy changed the title Write kubeadm control plane version file for workers to use to fetch the matching kubeadm ✨ Write kubeadm control plane version file for workers to use to fetch the matching kubeadm Mar 10, 2026
@AcidLeroy AcidLeroy changed the title ✨ Write kubeadm control plane version file for workers to use to fetch the matching kubeadm ✨ Write kubeadm control plane version file for workers to use to fetch the matching kubeadm binary Mar 10, 2026
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Mar 10, 2026
@AcidLeroy AcidLeroy marked this pull request as draft March 11, 2026 17:18
@k8s-ci-robot k8s-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Mar 11, 2026
Copy link
Copy Markdown

@zarcen zarcen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for putting all this together @AcidLeroy. Have some suggestions

@linux-foundation-easycla
Copy link
Copy Markdown

linux-foundation-easycla bot commented Mar 11, 2026

CLA Signed

The committers listed above are authorized under a signed CLA.

@k8s-ci-robot k8s-ci-robot added cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. and removed cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Mar 11, 2026
Copy link
Copy Markdown
Member

@neolit123 neolit123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i posted some comments on slack:
https://kubernetes.slack.com/archives/C8TSNPY4T/p1773251442281699

i think CAPI can just write the ClusterConfiguration on disk too.

@AcidLeroy
Copy link
Copy Markdown
Author

i posted some comments on slack: https://kubernetes.slack.com/archives/C8TSNPY4T/p1773251442281699

i think CAPI can just write the ClusterConfiguration on disk too.

@neolit123 I will provide an alternative solution by writing the ClusterConfiguration in a separate branch for now. If we find that we prefer that one, I'll merge it into this PR.

@AcidLeroy
Copy link
Copy Markdown
Author

@neolit123 Is this sort of what you are thinking: https://github.com/AcidLeroy/cluster-api/pull/3/changes

@neolit123
Copy link
Copy Markdown
Member

neolit123 commented Mar 12, 2026

@neolit123 Is this sort of what you are thinking: https://github.com/AcidLeroy/cluster-api/pull/3/changes

yes, sgtm, but up to maintainers to decide.

EDIT: in the slack thread we figured out that the CAPI v1beta2 ClusteConfiguration doesn't have the kubernetesVersion field, so my proposal is not useful.

@AcidLeroy
Copy link
Copy Markdown
Author

Rather than hard coding a file, we should look into providing a go template file (kubeadmconfig) and then in the kube bootsrtap config controller, we could render that file with the version directly into it.

Look into to resolveFiles in kubeadm and templating the version into the "fetch kubeadm version" script.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file might be in the gitignore file. Should double check.

@@ -0,0 +1,328 @@
/*
Copyright 2025 The Kubernetes Authors.
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2026

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. and removed cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. labels Mar 20, 2026
Comment on lines +843 to +867
// getControlPlaneVersionForJoin returns the control plane (cluster) version from the cluster's ControlPlaneRef,
// e.g. KubeadmControlPlane.spec.version. Returns empty string if the cluster has no ControlPlaneRef or the version
// cannot be read (e.g. control plane not found or does not support version). Used for worker join so that
// a 1.34 node uses kubeadm 1.35 when the control plane is at 1.35, for example.
func (r *KubeadmConfigReconciler) getControlPlaneVersionForJoin(ctx context.Context, scope *Scope) string {
if !scope.Cluster.Spec.ControlPlaneRef.IsDefined() {
return ""
}
controlPlane, err := external.GetObjectFromContractVersionedRef(ctx, r.Client, scope.Cluster.Spec.ControlPlaneRef, scope.Cluster.Namespace)
if err != nil {
scope.V(4).Info("Could not get control plane for version, falling back to machine version", "error", err)
return ""
}
cpVersion, err := contract.ControlPlane().Version().Get(controlPlane)
if err != nil {
if !errors.Is(err, contract.ErrFieldNotFound) {
scope.V(4).Info("Could not get control plane version, falling back to machine version", "error", err)
}
return ""
}
if cpVersion == nil {
return ""
}
return *cpVersion
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question: I notice this falls back to the machine version for any error (not found, permission denied, network issue, etc.). Is that intentional for all error types, or would it be worth distinguishing "control plane not found / field not present" (expected) from unexpected failures?

Just wondering whether masking unexpected errors here could make debugging harder down the road.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I think there are some error states here that we can requeue for and only fall back to machine version as an absolute last resort. I'll push some changes up shortly with an alternative to what I have here.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zjs, I introduced some changes to include another condition so that we can surface any issues with getting the CP version. In this case, we only fall back to the machine version if we absolutely have to, and we'll be able to see what the issue is via the status conditions. LMK what you think! Thanks!

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curious to see how others feel, but personally, I like it!

@k8s-ci-robot k8s-ci-robot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Mar 24, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/bootstrap Issues or PRs related to bootstrap providers cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants